Symbolic Heuristic Search Value Iteration for Factored POMDPs

نویسندگان

  • Hyeong Seop Sim
  • Kee-Eung Kim
  • Jin Hyung Kim
  • Du-Seong Chang
  • Myoung-Wan Koo
چکیده

We propose Symbolic heuristic search value iteration (Symbolic HSVI) algorithm, which extends the heuristic search value iteration (HSVI) algorithm in order to handle factored partially observable Markov decision processes (factored POMDPs). The idea is to use algebraic decision diagrams (ADDs) for compactly representing the problem itself and all the relevant intermediate computation results in the algorithm. We leverage Symbolic Perseus for computing the lower bound of the optimal value function using ADD operators, and provide a novel ADD-based procedure for computing the upper bound. Experiments on a number of standard factored POMDP problems show that we can achieve an order of magnitude improvement in performance over previously proposed algorithms. Partially observable Markov decision processes (POMDPs) are widely used for modeling stochastic sequential decision problems with noisy observations. However, when we model real-world problems, we often need a compact representation since the model may result in a very large number of states when using the flat, table-based representation. Factored POMDPs (Boutilier & Poole 1996) are one type of such compact representation. Given a factored POMDP, we have to design an algorithm that does not explicitly enumerate all the states in the model. For this purpose, it has gained popularity over the years to use the algebraic decision diagram (ADD) representation (Bahar et al. 1993) of all the vectors and matrices used in conventional POMDP algorithms. Hansen & Feng (2000) extended the Incremental Pruning algorithm (Cassandra, Littman, & Zhang 1997) to use ADDs. More recently, Poupart (2005) proposed the Symbolic Perseus algorithm, which is a point-based value iteration using ADDs. Symbolic Perseus was the primary motivation for our work to design a symbolic version of heuristic search value iteration (HSVI) algorithm (Smith & Simmons 2004; 2005) to handle factored POMDPs by using ADDs in a similar manner. HSVI is also a point-based value iteration algorithm that recursively explores important belief points for approximating the optimal value function. For this purpose, HSVI computes both the lower as well as the upper bound of the value Copyright c © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Procedure: π = HSVI(ǫ) Initialize V⊕ and V⊖ while V⊕(b0)− V⊖(b0) > ǫ do explore(b0, V⊕, V⊖) end while Procedure: explore(b, V⊕, V⊖) if V⊕(b)− V⊖(b) ≤ ǫγ then return end if a ← argmaxa QV⊕(b, a) z ← argmaxz P (z|b, a ) ·[V⊕(τ(b, a , z))−V⊖(τ(b, a , z))− ǫγ] explore(τ(b, a, z), V⊕, V⊖) V⊖ ← V⊖ ∪ backup(b, V⊖) V⊕ ← V⊕ ∪ {(b,HV⊕(b))} Procedure: α = backup(b, V⊖) αa,z ← argmaxα⊖∈V⊖(α⊖ · τ(b, a, z)) αa(s) ← R(s, a)+γ ∑ z,s′ αa,z(s )O(s, a, z)T (s, a, s) α ← argmaxαa(αa · b) Figure 1: Top level pseudo-code of HSVI function in each recursive exploration. Our proposed algorithm (hereafter Symbolic HSVI) uses ADD operators to compute the upper and lower bound without explicitly enumerating the states and observations of the POMDP. We describe the implementation of core procedures and the experimental results on a number of benchmark factored POMDP problems. Overview of POMDPs and HSVI POMDPs model stochastic control problems with partially observable states. A POMDP is specified as a tuple 〈S,A,Z, T,O,R, b0〉: S is the set of states (s is a state); A is the set of actions (a is an action); Z is the set of observations (z is an observation); T (s, a, s) is the transition probability of changing to state s from state s by executing action a; O(s, a, z) is the observation probability of observing z after executing action a and reaching in state s; R(s, a) is the immediate reward of executing action a in state s; b0 is the initial belief, which is the starting probability distribuProceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008)

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

FHHOP: A Factored Hybrid Heuristic Online Planning Algorithm for Large POMDPs

Planning in partially observable Markov decision processes (POMDPs) remains a challenging topic in the artificial intelligence community, in spite of recent impressive progress in approximation techniques. Previous research has indicated that online planning approaches are promising in handling large-scale POMDP domains efficiently as they make decisions “on demand” instead of proactively for t...

متن کامل

Solving POMDPs by Searching in Policy Space

Most algorithms for solving POMDPs itera­ tively improve a value function that implic­ itly represents a policy and are said to search in value function space. This paper presents an approach to solving POMDPs that repre­ sents a policy explicitly as a finite-state con­ troller and iteratively improves the controller by search in policy space. Two related al­ gorithms illustrate this approach. ...

متن کامل

Factored Upper Bounds for Multiagent Planning Problems under Uncertainty with Non-Factored Value Functions

Nowadays, multiagent planning under uncertainty scales to tens or even hundreds of agents. However, current methods either are restricted to problems with factored value functions, or provide solutions without any guarantees on quality. Methods in the former category typically build on heuristic search using upper bounds on the value function. Unfortunately, no techniques exist to compute such ...

متن کامل

Heuristic Search Value Iteration for POMDPs

We present a novel POMDP planning algorithm called heuristic search value iteration (HSVI). HSVI is an anytime algorithm that returns a policy and a provable bound on its regret with respect to the optimal policy. HSVI gets its power by combining two well-known techniques: attention-focusing search heuristics and piecewise linear convex representations of the value function. HSVI’s soundness an...

متن کامل

Solving Factored POMDPs with Linear Value Functions

Partially Observable Markov Decision Processes (POMDPs) provide a coherent mathematical framework for planning under uncertainty when the state of the system cannot be fully observed. However, the problem of finding an exact POMDP solution is intractable. Computing such solution requires the manipulation of a piecewise linear convex value function, which specifies a value for each possible beli...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008